Head and summary of the Nhanes 2015/2016 dataset
knitr::kable(head(nhanes2))
| SEQN | Alcohol_Year | Smoked_100 | Gender | Age | Race | Education | Marital_Status | Household_Size | Income_to_Pov | Systolic_BP1 | Diastolic_BP1 | Systolic_BP2 | Diastolic_BP2 | BMXWT | BMXHT | BMXBMI | BMXLEG | BMXARML | BMXARMC | BMXWAIST | HIQ210 | EducationX | agegroup |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 83732 | Yes | Yes | Male | 62 | White | 5 | Married | 2 | 4.39 | 128 | 70 | 124 | 64 | 94.8 | 184.5 | 27.8 | 43.3 | 43.6 | 35.9 | 101.1 | 2 | College/Uni graduate or above | 60-69 |
| 83733 | Yes | Yes | Male | 53 | White | 3 | Divorced | 1 | 1.32 | 146 | 88 | 140 | 88 | 90.4 | 171.4 | 30.8 | 38.0 | 40.0 | 33.2 | 107.9 | NA | High school graduate | 50-59 |
| 83734 | Yes | Yes | Male | 78 | White | 3 | Married | 2 | 1.51 | 138 | 46 | 132 | 44 | 83.4 | 170.1 | 28.8 | 35.6 | 37.0 | 31.0 | 116.5 | 2 | High school graduate | 70-80+ |
| 83735 | No | No | Female | 56 | White | 5 | Living with partner | 1 | 5.00 | 132 | 72 | 134 | 68 | 109.8 | 160.9 | 42.4 | 38.5 | 37.7 | 38.3 | 110.1 | 2 | College/Uni graduate or above | 50-59 |
| 83736 | No | No | Female | 42 | Black | 4 | Divorced | 5 | 1.23 | 100 | 70 | 114 | 54 | 55.2 | 164.9 | 20.3 | 37.4 | 36.0 | 27.2 | 80.4 | 2 | Some college/Uni | 40-49 |
| 83737 | No | No | Female | 72 | Mexican American | 2 | Separated | 5 | 2.82 | 116 | 58 | 122 | 58 | 64.4 | 150.0 | 28.6 | 34.4 | 33.5 | 31.4 | 92.9 | NA | 9-11th grade | 70-80+ |
summary(nhanes2)
## SEQN Alcohol_Year Smoked_100 Gender
## Min. :83732 Don't know: 3 Don't know: 8 Female:2976
## 1st Qu.:86164 No :1728 No :3406 Male :2759
## Median :88668 Yes :3477 Refused : 2
## Mean :88679 NA's : 527 Yes :2319
## 3rd Qu.:91178
## Max. :93702
##
## Age Race Education
## Min. :18.00 Black :1227 Min. :1.000
## 1st Qu.:32.00 Mexican American:1018 1st Qu.:3.000
## Median :48.00 Other Hispanic : 750 Median :4.000
## Mean :48.05 Other Race : 901 Mean :3.442
## 3rd Qu.:63.00 White :1839 3rd Qu.:4.750
## Max. :80.00 Max. :9.000
## NA's :261
## Marital_Status Household_Size Income_to_Pov Systolic_BP1
## Married :2780 1: 770 Min. :0.000 Min. : 82.0
## Never Married :1004 2:1546 1st Qu.:1.060 1st Qu.:112.0
## Divorced : 579 3:1037 Median :1.980 Median :122.0
## Living with partner: 527 4: 936 Mean :2.403 Mean :125.1
## Widowed : 396 5: 699 3rd Qu.:3.740 3rd Qu.:134.0
## (Other) : 188 6: 379 Max. :5.000 Max. :236.0
## NA's : 261 7: 368 NA's :601 NA's :334
## Diastolic_BP1 Systolic_BP2 Diastolic_BP2 BMXWT
## Min. : 0.00 Min. : 84.0 Min. : 0.00 Min. : 32.40
## 1st Qu.: 62.00 1st Qu.:112.0 1st Qu.: 62.00 1st Qu.: 65.90
## Median : 70.00 Median :122.0 Median : 70.00 Median : 78.20
## Mean : 69.52 Mean :124.8 Mean : 69.35 Mean : 81.34
## 3rd Qu.: 78.00 3rd Qu.:134.0 3rd Qu.: 78.00 3rd Qu.: 92.70
## Max. :120.00 Max. :238.0 Max. :144.00 Max. :198.90
## NA's :334 NA's :200 NA's :200 NA's :69
## BMXHT BMXBMI BMXLEG BMXARML
## Min. :129.7 Min. :14.50 Min. :26.00 Min. :28.20
## 1st Qu.:158.7 1st Qu.:24.30 1st Qu.:36.00 1st Qu.:35.20
## Median :166.0 Median :28.30 Median :38.60 Median :37.10
## Mean :166.1 Mean :29.38 Mean :38.58 Mean :37.15
## 3rd Qu.:173.5 3rd Qu.:33.00 3rd Qu.:41.20 3rd Qu.:39.00
## Max. :202.7 Max. :67.30 Max. :51.50 Max. :47.40
## NA's :62 NA's :73 NA's :390 NA's :308
## BMXARMC BMXWAIST HIQ210
## Min. :17.10 Min. : 58.70 Min. :1.000
## 1st Qu.:29.50 1st Qu.: 87.60 1st Qu.:2.000
## Median :32.70 Median : 98.30 Median :2.000
## Mean :33.11 Mean : 99.57 Mean :1.915
## 3rd Qu.:36.20 3rd Qu.:109.30 3rd Qu.:2.000
## Max. :58.40 Max. :171.60 Max. :9.000
## NA's :308 NA's :367 NA's :1003
## EducationX agegroup
## < 9th grade : 655 18-29 :1192
## 9-11th grade : 643 30-39 : 933
## High school graduate :1186 40-49 : 913
## Some college/Uni :1621 50-59 : 888
## College/Uni graduate or above:1366 60-69 : 917
## 9 : 3 70-80+: 892
## NA's : 261
****3 functions created (saved in lib file). 1. ‘corr_func’ to quickly test correlations on the variables whilst exploring the data). The sole purpose using this rather than the built-in function is so the argument ‘complete.obs’ doesn’t have to be typed each time. 2. ‘grp_func’ to quickly allow the user to input a variable to group by and a variable to analyse which will output the min, max, and mean. 3. ‘prop_func’ takes the following aruguments: dataframe, catagorical group variable, and binary variable to analyse. Then outputs a new dataframe of proportions based on these arguments (split by gender). This will be used in the shiny dashboard.
Below code to firstly find out all the columns containing numeric data (numeric_vars. Then a new data table is created with these values (numeric_data). From this table, all the correlations were worked out wit the ‘cor’ function and saved under a variable names ‘correlations’. A correlation plot was created to clearly show any positive or negative correlations between these variables.
Group by agegrp to look at proportions of people who have smoked 100 cigarettes in thei’r life and people who haven’t. Assume that the the older generations will have a larger proportion of smokers than the youngers ones. Displayed in pie charts
smoke_df = filter(nhanes2, Smoked_100!="Don't know" & Smoked_100!='Refused')
prop = smoke_df %>%
count(.data[['agegroup']], .data[['Smoked_100']], na.rm = TRUE) %>%
mutate(prop = prop.table(n))
smoke = group_by(prop, agegroup) %>% transmute(Smoked_100, Proportion = n/sum(n))
p1 = smoke[1:2,]
p2 = smoke[3:4,]
p3 = smoke[5:6,]
p4 = smoke[7:8,]
pie1 = ggplot(p1, aes(x='', y=Proportion, fill =Smoked_100 )) +
geom_bar(stat='identity', width=1) + labs(title = 'Age - 18-29') +
coord_polar('y', start=0) +
geom_text(aes(label = percent(Proportion)), position = position_stack(vjust = 0.5), size = 5) +
scale_fill_brewer(palette = 'Dark2') +
theme_void()
pie2 = ggplot(p2, aes(x='', y=Proportion, fill =Smoked_100 )) +
geom_bar(stat='identity', width=1) + labs(title = 'Age - 30-39') +
coord_polar('y', start=0) +
geom_text(aes(label = percent(Proportion)), position = position_stack(vjust = 0.5), size = 5) +
scale_fill_brewer(palette = 'Dark2') +
theme_void()
pie3 = ggplot(p3, aes(x='', y=Proportion, fill =Smoked_100)) +
geom_bar(stat='identity', width=1) + labs(title = 'Age - 40-49') +
coord_polar('y', start=0) +
geom_text(aes(label = percent(Proportion)), position = position_stack(vjust = 0.5), size = 5) +
scale_fill_brewer(palette = 'Dark2') +
theme_void()
pie4 = ggplot(p4, aes(x='', y=Proportion, fill =Smoked_100)) +
geom_bar(stat='identity', width=1) + labs(title = 'Age - 50-59') +
coord_polar('y', start=0) +
geom_text(aes(label = percent(Proportion)), position = position_stack(vjust = 0.5), size = 5) +
scale_fill_brewer(palette = 'Dark2') +
theme_void()
grid.arrange(
pie1, pie2, pie3, pie4, nrow=2, ncol = 2, top = 'Proportion of people who have smoked 100 cigarettes'
)
Below some box plots and scatter plots have been made influenced by the correlation plot. We hope to find some interetsing results from this. cor_func and grp_func was also used for futher testing. Based on the results some furhter hypothesis testing will be carried out.
#Box plots
b1 = ggplot(nhanes2, aes(x=agegroup, y=Systolic_BP1, color=Gender)) +
geom_boxplot()
b1
b1_test = filter(nhanes2, Age>69)
grp_func(b1_test, 'Gender', 'Systolic_BP1')
## # A tibble: 2 x 4
## `data[[group_var]]` Mean Max Min
## <fct> <dbl> <dbl> <dbl>
## 1 Female 140. 210 84
## 2 Male 137. 204 92
## Mean Max Min
## 1 138.2204 210 84
#Hypothesis test. 1 to see if men of all ages have higher Systolic BP than females, and
#another to test if females aged 70+ have a higher Systolic BP than men. (test below)
#Filter dataframe to remove below value
df2 = filter(nhanes2, EducationX !='9')
b2 = ggplot(df2, aes(x=EducationX, y=Income_to_Pov, color=Gender)) +
geom_boxplot()
b2
nhanes2$Household_Size = as.factor(nhanes2$Household_Size)
b3 = ggplot(nhanes2, aes(x=Household_Size, y=Income_to_Pov, color=Gender)) +
geom_boxplot()
b3
#Hypothesis test to see if females living alone earn more than men living on their own
#(test below)
b4 = ggplot(nhanes2, aes(x=Race, y=Income_to_Pov, color=Gender)) +
geom_boxplot()
b4
#Scatter plots
s1 = ggplot(nhanes2, aes(x=BMXWAIST, y=BMXBMI, color=Gender)) +
geom_point(alpha=0.2) +
scale_color_manual(breaks = c('Male', 'Female'), values = c('blue', 'red'))
s1
corr_func('BMXWAIST', 'BMXBMI')
## [1] 0.910781
grp_func(nhanes2, 'Gender', 'BMXBMI')
## # A tibble: 2 x 4
## `data[[group_var]]` Mean Max Min
## <fct> <dbl> <dbl> <dbl>
## 1 Female 29.9 67.3 14.5
## 2 Male 28.8 58.8 15.1
## Mean Max Min
## 1 29.3822 67.3 14.5
#Filter dataframe to remove below values
df1 = filter(nhanes2, Smoked_100 !="Don't know" & Smoked_100 !='Refused')
s2 = ggplot(df1, aes(x=Systolic_BP1, y=Systolic_BP2, color = Smoked_100)) +
geom_point(alpha=0.2) +
scale_color_manual(breaks = c('Yes', 'No'), values = c('blue', 'red'))
s2
corr_func('Systolic_BP1', 'Systolic_BP2')
## [1] 0.9622873
s3 = ggplot(df1, aes(x=Systolic_BP1, y=BMXBMI, color = Smoked_100)) +
geom_point(alpha=0.2) +
scale_color_manual(breaks = c('Yes', 'No'), values = c('blue', 'red'))
s3
corr_func('Systolic_BP1', 'BMXBMI')
## [1] 0.1352012
s4 = ggplot(df1, aes(x=Age, y=Systolic_BP1, color = Smoked_100)) +
geom_point(alpha=0.2) +
scale_color_manual(breaks = c('Yes', 'No'), values = c('blue', 'red'))
s4
corr_func('Age', 'Systolic_BP1')
## [1] 0.4692335
grid.arrange(
s1, s2, s3, s4, nrow=2, ncol = 2, top = 'title'
)
Shiny dashboard with interactive scatter plot and box plot. Interactive table using the prop_func
ui <- fluidPage()
server <- function(input, output) {}
# Define UI
ui <- fluidPage(
titlePanel("Nhanes Dashboard"),
# Sidebar layout with a input and output definitions
sidebarLayout(
# Inputs: Select variables to plot
sidebarPanel(
# Select variable for y-axis
selectInput(inputId = "x", label = "Scatter - Variable in X axis:",
choices = c('Age', 'Systolic_BP1', 'Systolic_BP2',
'Diastolic_BP1', 'Diastolic_BP2', 'BMXWT', 'BMXHT', 'BMXBMI', 'BMXWAIST'),
selected = "Systolic_BP1"), # Select variable for x-axis
selectInput(inputId = "y", label = "Scatter - Variable in Y Axis",
choices = c('Age', 'Income_to_Pov', 'Systolic_BP1', 'Systolic_BP2',
'Diastolic_BP1', 'Diastolic_BP2', 'BMXWT', 'BMXHT', 'BMXBMI', 'BMXWAIST'),
selected = "Systolic_BP2"),
sliderInput("range",
label = "Scatter X interval:",
min = 0, max = 240, value = c(80, 240)),
selectInput(inputId = "x2", label = "Box - Variable in X axis:",
choices = c('agegroup', "Alcohol_Year", "Smoked_100", "Race", "Education", "Marital_Status",
'Household_Size', 'Income_to_Pov'),
selected = "Household_Size"), # Select variable for x-axis
selectInput(inputId = "y2", label = "Box - Variable in Y axis:",
choices = c('Systolic_BP1', 'Systolic_BP2', 'Diastolic_BP1', 'Diastolic_BP2',
'BMXWT', 'BMXHT', 'BMXBMI', 'BMXWAIST', 'Income_to_Pov'),
selected = "Income_to_Pov"),
radioButtons("checkGroup", label = h3("Select data by:"),
choices = list("Gender" = "Gender", "Age Group" = "agegroup",
"Smoked 100" = "Smoked_100", "Alcohol Year" = "Alcohol_Year"),
selected = "Gender"),
),
# Output
mainPanel(
tabsetPanel(type="tab",
tabPanel("Data", plotOutput(outputId = "scatterplot"),
plotOutput(outputId = "boxplot")),
tabPanel("Summary of Nhanes dataset", verbatimTextOutput("summ")),
tabPanel("Group table with proportions", radioButtons("Grp_prop", label = h3("Select data by group:"),
choices = list("Age Group" = "agegroup",
"Race" = "Race", "Education" = "Education",
"Marital Status" = "Marital_Status", "Household Size" = "Household_Size"),
selected = "agegroup"),
radioButtons("analyse_prop", label = h3("Select data by:"),
choices = list("Smoked 100" = "Smoked_100", "Alcohol Year" = "Alcohol_Year"),
selected = "Smoked_100"),dataTableOutput("prop"))),
)
)
)
# Define server function
server <- function(input, output) {
# Create the scatterplot object the plotOutput function is expecting
output$scatterplot <- renderPlot({
nh3 = filter(nhanes2, !!sym(input$checkGroup) != "Don't know" & !!sym(input$checkGroup)!= "Refused")
ggplot(data = nh3, aes_string(x = input$x, y = input$y, group = input$checkGroup,
colour = input$checkGroup)) + geom_point(alpha = 0.5) +
xlim(input$range[1], input$range[2])
})
output$boxplot <- renderPlot({
nh3 = filter(nhanes2, !!sym(input$checkGroup) != "Don't know" & !!sym(input$checkGroup)!= "Refused" & !!sym(input$x2)!= "9" & !!sym(input$x2)!= "77" & !!sym(input$x2)!="Don't know" & !!sym(input$x2)!="Refused")
nh3 = filter(nh3, !is.na(!!sym(input$checkGroup)))
ggplot(data = nh3, aes_string(x = input$x2, y = input$y2,
colour = input$checkGroup)) +geom_boxplot()
})
# You can access the value of the widget with input$checkbox, e.g.
output$value <- renderPrint({ input$checkGroup })
output$range <- renderPrint({ input$slider })
output$prop = renderDataTable({prop_func(nhanes2, input$Grp_prop, input$analyse_prop)})
output$summ = renderPrint({summary(nhanes2)})
}
shinyApp(ui = ui, server = server)
## PhantomJS not found. You can install it with webshot::install_phantomjs(). If it is installed, please make sure the phantomjs executable can be found via the PATH variable.
Hypothesis testing - all below tests based on significance level of 0.05
T-test to see if male who has smoked 100 cigarettes in his live and had at least 12 alcoholic drinks in 1 year has a higher Systolic blood pressure than a male who has not. H0 != H1
test1_1 = filter(nhanes2, Gender=='Male' & Smoked_100=='Yes' & Alcohol_Year=='Yes')
test1_2 = filter(nhanes2, Gender=='Male' & Smoked_100=='No' & Alcohol_Year=='No')
#Test using the t.test function
t.test(test1_1$Systolic_BP1, test1_2$Systolic_BP1, alternative = "two.sided",
var.equal = FALSE, paired = FALSE)
##
## Welch Two Sample t-test
##
## data: test1_1$Systolic_BP1 and test1_2$Systolic_BP1
## t = 4.6317, df = 692.33, p-value = 4.333e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.701071 6.676093
## sample estimates:
## mean of x mean of y
## 128.8670 124.1784
#Test manually. As the std_dev on each sample are not similar, I have used the
#unpooled approach
t1_n1 = nrow(test1_1)
t1_n2 = nrow(test1_2)
t1_u1 = mean(test1_1$Systolic_BP1, na.rm = TRUE)
t1_u2 = mean(test1_2$Systolic_BP1, na.rm = TRUE)
t1_sig1 = sd(test1_1$Systolic_BP1, na.rm = TRUE)
t1_sig2 = sd(test1_2$Systolic_BP1, na.rm = TRUE)
t_test = (t1_u1-t1_u2)/sqrt(((t1_sig1^2)/t1_n1)+((t1_sig2^2)/t1_n2))
t_test
## [1] 4.730432
#t-score for inbuilt_funtion = 4.6317, t-score worked out manually is 4.73. Both give a
#p-value of <0.00001 which provides us with sufficient evidence to reject the null hypothesis
#and say that a male who has drank and smoked does have a higher systolic blood pressure than
#a male who has not. We can futher back this up by looking at the confidence intervals
#(2.7, 6.67) and seeing that 0 does not fall within this. Will run similar test for only the
#alcohol part of this (below)
#T-test to see if somoen who has at least 12 alcoholic drinks in a year has a higher Systolic
#BP than someone who has not.
test2_1 = filter(nhanes2, Gender=='Male' & Alcohol_Year=='Yes')
test2_2 = filter(nhanes2, Gender=='Male' & Alcohol_Year=='No')
#As the variance is similar (testing below), we will use thr pooled approach
sd(test2_1$Systolic_BP1, na.rm = TRUE)
## [1] 17.61412
sd(test2_2$Systolic_BP1, na.rm = TRUE)
## [1] 17.8955
t.test(test2_1$Systolic_BP1, test2_2$Systolic_BP1, alternative = 'two.sided', var.equal = TRUE)
##
## Two Sample t-test
##
## data: test2_1$Systolic_BP1 and test2_2$Systolic_BP1
## t = 2.275, df = 2446, p-value = 0.02299
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.2770041 3.7363882
## sample estimates:
## mean of x mean of y
## 127.4336 125.4269
#P-value of 0.02299 still provides us with enough evidence to reject the null and conclude that
#'on average' people who drink 12 alcoholic drinkls per year have a higher systolic BP than
#those tho don't
#Hypothesis test 3 - Do men of all ages have higher Systolic BP than women.
test3_1 = filter(nhanes2, Gender == 'Male')
test3_2 = filter(nhanes2, Gender == 'Female')
#Testing variancwe below. As the variance differs, we will use the unpooled approach
sd(test3_1$Systolic_BP1, na.rm = TRUE)
## [1] 17.64247
sd(test3_2$Systolic_BP1, na.rm = TRUE)
## [1] 19.06579
t.test(test3_1$Systolic_BP1, test3_2$Systolic_BP1, alternative = 'two.sided', var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: test3_1$Systolic_BP1 and test3_2$Systolic_BP1
## t = 7.4453, df = 5397.1, p-value = 1.12e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.739755 4.698244
## sample estimates:
## mean of x mean of y
## 126.9989 123.2799
#Hypothesis test to see if women >69 have a higher Systolic BP than men of the similar age
test4_1 = filter(nhanes2, Gender == 'Female' & Age > 69)
test4_2 = filter(nhanes2, Gender == 'Male' & Age > 69)
#Testing variancwe below. As the variance is similar, the pooled approach will be used
sd(test4_1$Systolic_BP1, na.rm = TRUE)
## [1] 20.7647
sd(test4_2$Systolic_BP1, na.rm = TRUE)
## [1] 20.46368
t.test(test4_1$Systolic_BP1, test4_2$Systolic_BP1, alternative = 'two.sided', var.equal = TRUE)
##
## Two Sample t-test
##
## data: test4_1$Systolic_BP1 and test4_2$Systolic_BP1
## t = 2.4618, df = 851, p-value = 0.01402
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.7044388 6.2458476
## sample estimates:
## mean of x mean of y
## 139.9763 136.5012
#Hypothesis test to see if females living alone earn more than males living alone.
test5_1 = filter(nhanes2, Gender == 'Female' & Household_Size == '1')
test5_2 = filter(nhanes2, Gender == 'Male' & Household_Size == '1')
#Signifincant difference in variance, so we will used the unpooled approach
sd(test5_1$Income_to_Pov, na.rm = TRUE)
## [1] 1.550615
sd(test5_2$Income_to_Pov, na.rm = TRUE)
## [1] 1.598473
t.test(test5_1$Income_to_Pov, test5_2$Income_to_Pov, alternative = 'two.sided', var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: test5_1$Income_to_Pov and test5_2$Income_to_Pov
## t = 0.11781, df = 685.79, p-value = 0.9063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.2201615 0.2482691
## sample estimates:
## mean of x mean of y
## 2.279837 2.265783
#With a p-value of 0.9063, we fail to reject the null hypothesis based on a significance
#level of 0.05. Confidence level also includes 0.